NLP and tokenization | Manticore Search Manual

A Template Table is a special type of table in Manticore that doesn't store any data and doesn't create any files on your disk. Despite this, it can have the same NLP settings as a plain or real-time table. Template tables can be used for the following purposes:

As a template to inherit settings in the Plain mode, simplifying your Manticore configuration file.
Keyword generation with the help of the CALL KEYWORDS command.
Highlighting an arbitrary string using the CALL SNIPPETS command.

‹›

CONFIG

CONFIG

📋

table template {
  type = template
  morphology = stem_en
  wordforms = wordforms.txt
  exceptions = exceptions.txt
  stopwords = stopwords.txt
}

Data tokenization

≫ NLP and tokenization

Manticore doesn't store text exactly as it is for full-text searching. Instead, it breaks the text into words (called tokens) and builds several internal structures to enable fast full-text searches. These structures include a dictionary that helps quickly check if a word exists in the index. Other structures track which documents and fields contain the word, and even where exactly in the field it appears. These are all used during a search to find relevant results.

The process of splitting and handling text like this is called tokenization. Tokenization happens both when adding data to the index and when running a search. It works at both the character and word level.

At the character level, only certain characters are allowed. This is controlled by the charset_table. Any other characters are replaced with a space (which is treated as a word separator). The charset_table also supports things like turning characters into lowercase or replacing one character with another. It can also define characters to be ignored, blended, or treated as a phrase boundary.

At the word level, the engine uses the min_word_len setting to decide the minimum word length (in characters) that should be indexed.

Manticore also supports matching words with different forms. For example, to treat "car" and "cars" as the same word, you can use morphology processors.

If you want different words to be treated as the same—for example, "USA" and "United States" — you can define them using the word forms feature.

Very common words (like "the", "and", "is") can slow down searches and increase index size. You can filter them out using stop words. This can make searches faster and the index smaller.

A more advanced filtering method is bigrams, which creates special tokens by combining a common word with an uncommon one. This can significantly speed up phrase searches when common words are involved.

If you're indexing HTML, it's usually best not to include the HTML tags in the index, since they add a lot of unnecessary content. You can use HTML stripping to remove the tags, but still index certain tag attributes or skip specific elements entirely.

Keep in mind that Manticore has a maximum token length of 42 characters. Any word longer than this will be truncated. This limit applies during both indexing and searching, so it's important to ensure your data and queries account for it.

Template table Supported languages

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cont (which is the default value). The non_cjk option which is an alias for non_cont can be used as well: charset_table = non_cjk.

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

basic support (column "Supported")
stopwords (column "Stopwords file name")
advanced morphology (column "Advanced morphology")

Language	Supported	Stopwords file name	Advanced morphology	Notes
Afrikaans	charset_table=non_cont	af	-
Arabic	charset_table=non_cont	ar	morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian	charset_table=non_cont	hy	-
Assamese	specify charset_table specify charset_table manually	-	-
Basque	charset_table=non_cont	eu	-
Bengali	charset_table=non_cont	bn	-
Bishnupriya	specify charset_table manually	-	-
Buhid	specify charset_table manually	-	-
Bulgarian	charset_table=non_cont	bg	-
Catalan	charset_table=non_cont	ca	morphology=libstemmer_ca
Chinese using ICU	charset_table=chinese	zh	morphology=icu_chinese	More accurate than using ngrams
Chinese using Jieba	charset_table=chinese	zh	morphology=jieba_chinese, requires package `manticore-language-packs`	More accurate than using ngrams
Chinese using ngrams	ngram_chars=chinese	zh	ngram_chars=1	Faster indexing, but the search performance might not be as good
Croatian	charset_table=non_cont	hr	-
Kurdish	charset_table=non_cont	ckb	-
Czech	charset_table=non_cont	cz	morphology=stem_cz (Czech stemmer)
Danish	charset_table=non_cont	da	morphology=libstemmer_da
Dutch	charset_table=non_cont	nl	morphology=libstemmer_nl
English	charset_table=non_cont	en	morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto	charset_table=non_cont	eo	-
Estonian	charset_table=non_cont	et	-
Finnish	charset_table=non_cont	fi	morphology=libstemmer_fi
French	charset_table=non_cont	fr	morphology=libstemmer_fr
Galician	charset_table=non_cont	gl	-
Garo	specify charset_table manually	-	-
German	charset_table=non_cont	de	morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek	charset_table=non_cont	el	morphology=libstemmer_el
Hebrew	charset_table=non_cont	he	-
Hindi	charset_table=non_cont	hi	morphology=libstemmer_hi
Hmong	specify charset_table manually	-	-
Ho	specify charset_table manually	-	-
Hungarian	charset_table=non_cont	hu	morphology=libstemmer_hu
Indonesian	charset_table=non_cont	id	morphology=libstemmer_id
Irish	charset_table=non_cont	ga	morphology=libstemmer_ga
Italian	charset_table=non_cont	it	morphology=libstemmer_it
Japanese	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	Requires ngram-based segmentation
Komi	specify charset_table manually	-	-
Korean	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	Requires ngram-based segmentation
Large Flowery Miao	specify charset_table manually	-	-
Latin	charset_table=non_cont	la	-
Latvian	charset_table=non_cont	lv	-
Lithuanian	charset_table=non_cont	lt	morphology=libstemmer_lt
Maba	specify charset_table manually	-	-
Maithili	specify charset_table manually	-	-
Marathi	specify charset_table manually	-	-
Marathi	charset_table=non_cont	mr	-
Mende	specify charset_table manually	-	-
Mru	specify charset_table manually	-	-
Myene	specify charset_table manually	-	-
Nepali	specify charset_table manually	-	morphology=libstemmer_ne
Ngambay	specify charset_table manually	-	-
Norwegian	charset_table=non_cont	no	morphology=libstemmer_no
Odia	specify charset_table manually	-	-
Persian	charset_table=non_cont	fa	-
Polish	charset_table=non_cont	pl	-
Portuguese	charset_table=non_cont	pt	morphology=libstemmer_pt
Romanian	charset_table=non_cont	ro	morphology=libstemmer_ro
Russian	charset_table=non_cont	ru	morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali	specify charset_table manually	-	-
Sindhi	specify charset_table manually	-	-
Slovak	charset_table=non_cont	sk	-
Slovenian	charset_table=non_cont	sl	-
Somali	charset_table=non_cont	so	-
Sotho	charset_table=non_cont	st	-
Spanish	charset_table=non_cont	es	morphology=libstemmer_es
Swahili	charset_table=non_cont	sw	-
Swedish	charset_table=non_cont	sv	morphology=libstemmer_sv
Sylheti	specify charset_table manually	-	-
Tamil	specify charset_table manually	-	morphology=libstemmer_ta
Thai	charset_table=thai	th	-
Turkish	charset_table=non_cont	tr	morphology=libstemmer_tr
Ukrainian	charset_table=non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk_all	Requires installation of UK lemmatizer
Yoruba	charset_table=non_cont	yo	-
Zulu	charset_table=non_cont	zu	-

Data tokenization Languages with continuous scripts

Template table

≫ NLP and tokenization

Data tokenization

Character-level tokenization

Word-level tokenization

Handling common and noisy words

HTML content

Token length limit

Supported languages